tensor compute cluster @ MEB2024-10-01
P:, Z:, kosmos, …Available at https://meb-ki.github.io/meb-tensor-docs/
Holds answers to many common questions (continuously updated)
sbatch - submits batch jobs
squeue, scancel[robkar@tensor ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
98108 core BT_K60_H hamkha R 6:16:46 1 tensor4
95475 core interact robkar R 1-19:57:08 1 tensor7
98123 core DCSM_bmi pegler R 3:36:49 1 tensor3
squeue -l, squeue -O ..., squeue --mesallocsbatch -c 4 -t 0-4 ...#SBATCH -c 4
#SBATCH -t 0-4
...-t or --time
days-hours or hours:minutes:seconds or minutes-c N, --cpus-per-task N for N threads--mem
nnn for nnn megabytes, or use suffixes [K|M|G|T]seff [jobid] (after the test job finishes)/usr/bin/time -v [your command...] (see results in slurm-$JOBID.out)srun --jobid=[jobid] --overlap --pty htop (monitor the job while it runs)
[robkar@tensor ~]$ salloc -t 1:00
salloc: Granted job allocation 113359
salloc: Nodes tensor4 are ready for job
[robkar@tensor4 ~]$ salloc: Job 113359 has exceeded its time limit and its allocation has been revoked.
slurmstepd: error: *** STEP 113359.interactive ON tensor4 CANCELLED AT 2024-09-25T12:00:52 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: tensor4: task 0: Killed
[robkar@tensor ~]$ seff 113359
Job ID: 113359
...
State: TIMEOUT (exit code 0)
...
[robkar@tensor ~]$ sbatch -t 0-1 -c 2 --mem=4G --wrap='ml add R; Rscript -e "x <- rnorm(1e9)"'
Submitted batch job 113367
[robkar@tensor ~]$ cat slurm-113367.out
/var/spool/slurm/d/job113367/slurm_script: line 4: 2590672 Killed Rscript -e "x <- rnorm(1e9)"
slurmstepd: error: Detected 1 oom_kill event in StepId=113367.batch. Some of the step tasks have been OOM Killed.
[robkar@tensor ~]$ seff 113367
Job ID: 113367
...
State: OUT_OF_MEMORY (exit code 0)
...
Memory Utilized: 3.00 MB
Memory Efficiency: 0.07% of 4.00 GB
Note that seff memory statistics do not always capture spikes in memory usage (but the system OOM killer will)
-J or --job-name
-e and -o
save your job’s error messages and output in named files
(-e and -o can point to the same file, by default "slurm-$JOBID.out")
--array=1-N--dependency=...salloc or sbatchml avail (or user guide)ml load package/versionml add mebauth for P/Z access/scratch is machine- (and job-)specific
tensor1:/scratch \(\ne\) tensor2:/scratch
/scratch is cleared when the job ends
Running a long command as the last thing in an interactive session
some_command --param 1 --param 2 && exit
sprio to see the priority for jobs in queueUse VDI (to ensure we all have the same environment) to
ONLY FOR TODAY: add --reservation=workshop to your sbatch commands to skip the queue
Example scripts available from